Preliminaries

We’ll be use Rstudio and R for this tutorial.

Learning goals and objectives

  • Use GEOquery to access gene expression data
  • Make a heatmap of the gene expression data
  • Perform Principal Components Analysis
  • Use the GGally and ggplot2 packages to view the PCA results in biological context.

Installing needed packages

In particular, we’ll be using the following packages:

  • ggplot2
  • GEOquery
  • SummarizedExperiment
  • ComplexHeatmap
  • GGally

The following code just installs any packages that are not already installed on your system.

pkgs = c(
  "ggplot2",
  "GEOquery",
  "SummarizedExperiment",
  "ComplexHeatmap",
  "GGally"
)
ins = installed.packages(repos = BiocManager::repositories())
for(pkg in pkgs) {
  if(!(pkg %in% rownames(ins)))
    BiocManager::install(pkg)
}

Using GEOquery to load data

Recall that NCBI GEO is a large database that contains publicly available and free gene expression data. We’ll assume here that you have already explored the literature and the GEO data available and have settled on the dataset entitled, “Gene expression profiles of breast, colorectal, prostate, and non-small cell lung cancer.” The dataset details are available here:

The associated publication, “Regulatory T-cell Genes Drive Altered Immune Microenvironment in Adult Solid Cancers and Allow for Immune Contextual Patient Subtyping,” is available here:

Once we have identified a dataset of interest, we can access it using GEOquery.

library(GEOquery)
# The [[1]] in the following line it because
# some GEO records include more than one dataset.
# In this cae, we just want the first (and only) 
# dataset. 
gse = getGEO("GSE103512")[[1]]

GEOquery was written many years ago. Since that time, the Bioconductor data structures for storing gene expression data have been improved and updated. The following code chunk just updates the gene expression data to use the new SummarizedExperiment container.

library(SummarizedExperiment)
se = as(gse, "SummarizedExperiment")

We can get a quick look at what is included by simply printing the se variable:

print(se)
## class: SummarizedExperiment 
## dim: 54715 280 
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(54715): 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-5_at
##   AFFX-TrpnX-M_at
## rowData names(16): ID GB_ACC ... Gene.Ontology.Cellular.Component
##   Gene.Ontology.Molecular.Function
## colnames(280): GSM2772660 GSM2772661 ... GSM2772938 GSM2772939
## colData names(72): title geo_accession ... tumorsize.ch1 weight.ch1

Questions

From the output of the print(se) statement:

  • How many samples are included?
  • How many genes are included?

Hint: Look at the dimensions listed in the output.

Explore dataset

Remember that the data in Bioconductor are now in the form of a SummarizedExperiment. You can read more about SummarizedExperiments here.

An overview of the data structure is portrayed in the following figure.

In particular, we might want to look at the information about the samples. To do so, we can use the colData method.

colData(se)
## DataFrame with 280 rows and 72 columns
##                             title geo_accession                status
##                       <character>   <character>           <character>
## GSM2772660       BC_coex_path_001    GSM2772660 Public on Sep 07 2017
## GSM2772661       BC_coex_path_002    GSM2772661 Public on Sep 07 2017
## GSM2772662       BC_coex_path_003    GSM2772662 Public on Sep 07 2017
## GSM2772663       BC_coex_path_004    GSM2772663 Public on Sep 07 2017
## GSM2772664       BC_coex_path_005    GSM2772664 Public on Sep 07 2017
## ...                           ...           ...                   ...
## GSM2772935 PC_coex_path_002_nor..    GSM2772935 Public on Sep 07 2017
## GSM2772936 PC_coex_path_003_nor..    GSM2772936 Public on Sep 07 2017
## GSM2772937 PC_coex_path_009_nor..    GSM2772937 Public on Sep 07 2017
## GSM2772938 PC_coex_path_004_nor..    GSM2772938 Public on Sep 07 2017
## GSM2772939 PC_coex_path_010_nor..    GSM2772939 Public on Sep 07 2017
##            submission_date last_update_date        type channel_count
##                <character>      <character> <character>   <character>
## GSM2772660     Sep 05 2017      Jan 23 2018         RNA             1
## GSM2772661     Sep 05 2017      Jan 23 2018         RNA             1
## GSM2772662     Sep 05 2017      Jan 23 2018         RNA             1
## GSM2772663     Sep 05 2017      Jan 23 2018         RNA             1
## GSM2772664     Sep 05 2017      Jan 23 2018         RNA             1
## ...                    ...              ...         ...           ...
## GSM2772935     Sep 05 2017      Jan 23 2018         RNA             1
## GSM2772936     Sep 05 2017      Jan 23 2018         RNA             1
## GSM2772937     Sep 05 2017      Jan 23 2018         RNA             1
## GSM2772938     Sep 05 2017      Jan 23 2018         RNA             1
## GSM2772939     Sep 05 2017      Jan 23 2018         RNA             1
##                   source_name_ch1 organism_ch1 characteristics_ch1
##                       <character>  <character>         <character>
## GSM2772660 Total RNA Seq profil.. Homo sapiens     cancer type: BC
## GSM2772661 Total RNA Seq profil.. Homo sapiens     cancer type: BC
## GSM2772662 Total RNA Seq profil.. Homo sapiens     cancer type: BC
## GSM2772663 Total RNA Seq profil.. Homo sapiens     cancer type: BC
## GSM2772664 Total RNA Seq profil.. Homo sapiens     cancer type: BC
## ...                           ...          ...                 ...
## GSM2772935 Total RNA Seq profil.. Homo sapiens    cancer type: PCA
## GSM2772936 Total RNA Seq profil.. Homo sapiens    cancer type: PCA
## GSM2772937 Total RNA Seq profil.. Homo sapiens    cancer type: PCA
## GSM2772938 Total RNA Seq profil.. Homo sapiens    cancer type: PCA
## GSM2772939 Total RNA Seq profil.. Homo sapiens    cancer type: PCA
##            characteristics_ch1.1 characteristics_ch1.2 characteristics_ch1.3
##                      <character>           <character>           <character>
## GSM2772660            normal: no         batch i/ii: I         histology: na
## GSM2772661            normal: no         batch i/ii: I         histology: na
## GSM2772662            normal: no         batch i/ii: I         histology: na
## GSM2772663            normal: no         batch i/ii: I         histology: na
## GSM2772664            normal: no         batch i/ii: I         histology: na
## ...                          ...                   ...                   ...
## GSM2772935           normal: yes        batch i/ii: II         histology: na
## GSM2772936           normal: yes        batch i/ii: II         histology: na
## GSM2772937           normal: yes        batch i/ii: II         histology: na
## GSM2772938           normal: yes        batch i/ii: II         histology: na
## GSM2772939           normal: yes        batch i/ii: II         histology: na
##            characteristics_ch1.4 characteristics_ch1.5 characteristics_ch1.6
##                      <character>           <character>           <character>
## GSM2772660               age: na            gender: na            weight: na
## GSM2772661               age: na            gender: na            weight: na
## GSM2772662               age: na            gender: na            weight: na
## GSM2772663               age: na            gender: na            weight: na
## GSM2772664               age: na            gender: na            weight: na
## ...                          ...                   ...                   ...
## GSM2772935               age: na            gender: na            weight: na
## GSM2772936               age: na            gender: na            weight: na
## GSM2772937               age: na            gender: na            weight: na
## GSM2772938               age: na            gender: na            weight: na
## GSM2772939               age: na            gender: na            weight: na
##            characteristics_ch1.7 characteristics_ch1.8 characteristics_ch1.9
##                      <character>           <character>           <character>
## GSM2772660        bodyheight: na               bmi: na       diseasecode: na
## GSM2772661        bodyheight: na               bmi: na       diseasecode: na
## GSM2772662        bodyheight: na               bmi: na       diseasecode: na
## GSM2772663        bodyheight: na               bmi: na       diseasecode: na
## GSM2772664        bodyheight: na               bmi: na       diseasecode: na
## ...                          ...                   ...                   ...
## GSM2772935        bodyheight: na               bmi: na       diseasecode: na
## GSM2772936        bodyheight: na               bmi: na       diseasecode: na
## GSM2772937        bodyheight: na               bmi: na       diseasecode: na
## GSM2772938        bodyheight: na               bmi: na       diseasecode: na
## GSM2772939        bodyheight: na               bmi: na       diseasecode: na
##            characteristics_ch1.10 characteristics_ch1.11 characteristics_ch1.12
##                       <character>            <character>            <character>
## GSM2772660 diseasedescription: na primary / relapse tu..          tumorsize: na
## GSM2772661 diseasedescription: na primary / relapse tu..          tumorsize: na
## GSM2772662 diseasedescription: na primary / relapse tu..          tumorsize: na
## GSM2772663 diseasedescription: na primary / relapse tu..          tumorsize: na
## GSM2772664 diseasedescription: na primary / relapse tu..          tumorsize: na
## ...                           ...                    ...                    ...
## GSM2772935 diseasedescription: na primary / relapse tu..          tumorsize: na
## GSM2772936 diseasedescription: na primary / relapse tu..          tumorsize: na
## GSM2772937 diseasedescription: na primary / relapse tu..          tumorsize: na
## GSM2772938 diseasedescription: na primary / relapse tu..          tumorsize: na
## GSM2772939 diseasedescription: na primary / relapse tu..          tumorsize: na
##            characteristics_ch1.13 characteristics_ch1.14 characteristics_ch1.15
##                       <character>            <character>            <character>
## GSM2772660       ischemiatime: na       tumorcontent: na              organ: na
## GSM2772661       ischemiatime: na       tumorcontent: na              organ: na
## GSM2772662       ischemiatime: na       tumorcontent: na              organ: na
## GSM2772663       ischemiatime: na       tumorcontent: na              organ: na
## GSM2772664       ischemiatime: na       tumorcontent: na              organ: na
## ...                           ...                    ...                    ...
## GSM2772935       ischemiatime: na       tumorcontent: na              organ: na
## GSM2772936       ischemiatime: na       tumorcontent: na              organ: na
## GSM2772937       ischemiatime: na       tumorcontent: na              organ: na
## GSM2772938       ischemiatime: na       tumorcontent: na              organ: na
## GSM2772939       ischemiatime: na       tumorcontent: na              organ: na
##            characteristics_ch1.16 characteristics_ch1.17 characteristics_ch1.18
##                       <character>            <character>            <character>
## GSM2772660  tumorlocalization: na              Stage: na         radicality: na
## GSM2772661  tumorlocalization: na              Stage: na         radicality: na
## GSM2772662  tumorlocalization: na              Stage: na         radicality: na
## GSM2772663  tumorlocalization: na              Stage: na         radicality: na
## GSM2772664  tumorlocalization: na              Stage: na         radicality: na
## ...                           ...                    ...                    ...
## GSM2772935  tumorlocalization: na              Stage: na         radicality: na
## GSM2772936  tumorlocalization: na              Stage: na         radicality: na
## GSM2772937  tumorlocalization: na              Stage: na         radicality: na
## GSM2772938  tumorlocalization: na              Stage: na         radicality: na
## GSM2772939  tumorlocalization: na              Stage: na         radicality: na
##            characteristics_ch1.19 characteristics_ch1.20 treatment_protocol_ch1
##                       <character>            <character>            <character>
## GSM2772660            grading: na            dignity: na Tumor samples were f..
## GSM2772661            grading: na            dignity: na Tumor samples were f..
## GSM2772662            grading: na            dignity: na Tumor samples were f..
## GSM2772663            grading: na            dignity: na Tumor samples were f..
## GSM2772664            grading: na            dignity: na Tumor samples were f..
## ...                           ...                    ...                    ...
## GSM2772935            grading: na            dignity: na Tumor samples were f..
## GSM2772936            grading: na            dignity: na Tumor samples were f..
## GSM2772937            grading: na            dignity: na Tumor samples were f..
## GSM2772938            grading: na            dignity: na Tumor samples were f..
## GSM2772939            grading: na            dignity: na Tumor samples were f..
##               growth_protocol_ch1 molecule_ch1   extract_protocol_ch1
##                       <character>  <character>            <character>
## GSM2772660 Samples were extract..    total RNA Extraction was done ..
## GSM2772661 Samples were extract..    total RNA Extraction was done ..
## GSM2772662 Samples were extract..    total RNA Extraction was done ..
## GSM2772663 Samples were extract..    total RNA Extraction was done ..
## GSM2772664 Samples were extract..    total RNA Extraction was done ..
## ...                           ...          ...                    ...
## GSM2772935 Samples were extract..    total RNA Extraction was done ..
## GSM2772936 Samples were extract..    total RNA Extraction was done ..
## GSM2772937 Samples were extract..    total RNA Extraction was done ..
## GSM2772938 Samples were extract..    total RNA Extraction was done ..
## GSM2772939 Samples were extract..    total RNA Extraction was done ..
##              label_ch1     label_protocol_ch1   taxid_ch1
##            <character>            <character> <character>
## GSM2772660      biotin Biotin-Labeling and ..        9606
## GSM2772661      biotin Biotin-Labeling and ..        9606
## GSM2772662      biotin Biotin-Labeling and ..        9606
## GSM2772663      biotin Biotin-Labeling and ..        9606
## GSM2772664      biotin Biotin-Labeling and ..        9606
## ...                ...                    ...         ...
## GSM2772935      biotin Biotin-Labeling and ..        9606
## GSM2772936      biotin Biotin-Labeling and ..        9606
## GSM2772937      biotin Biotin-Labeling and ..        9606
## GSM2772938      biotin Biotin-Labeling and ..        9606
## GSM2772939      biotin Biotin-Labeling and ..        9606
##                      hyb_protocol          scan_protocol        data_processing
##                       <character>            <character>            <character>
## GSM2772660 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772661 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772662 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772663 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772664 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## ...                           ...                    ...                    ...
## GSM2772935 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772936 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772937 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772938 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772939 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
##            platform_id  contact_name     contact_department
##            <character>   <character>            <character>
## GSM2772660    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772661    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772662    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772663    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772664    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## ...                ...           ...                    ...
## GSM2772935    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772936    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772937    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772938    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772939    GPL13158 Wei-Yi,,Cheng Pharma Research and ..
##                 contact_institute        contact_address contact_city
##                       <character>            <character>  <character>
## GSM2772660 Roche Innovation Cen.. Roche Translational ..     New York
## GSM2772661 Roche Innovation Cen.. Roche Translational ..     New York
## GSM2772662 Roche Innovation Cen.. Roche Translational ..     New York
## GSM2772663 Roche Innovation Cen.. Roche Translational ..     New York
## GSM2772664 Roche Innovation Cen.. Roche Translational ..     New York
## ...                           ...                    ...          ...
## GSM2772935 Roche Innovation Cen.. Roche Translational ..     New York
## GSM2772936 Roche Innovation Cen.. Roche Translational ..     New York
## GSM2772937 Roche Innovation Cen.. Roche Translational ..     New York
## GSM2772938 Roche Innovation Cen.. Roche Translational ..     New York
## GSM2772939 Roche Innovation Cen.. Roche Translational ..     New York
##            contact_state contact_zip.postal_code contact_country
##              <character>             <character>     <character>
## GSM2772660      New York                   10016             USA
## GSM2772661      New York                   10016             USA
## GSM2772662      New York                   10016             USA
## GSM2772663      New York                   10016             USA
## GSM2772664      New York                   10016             USA
## ...                  ...                     ...             ...
## GSM2772935      New York                   10016             USA
## GSM2772936      New York                   10016             USA
## GSM2772937      New York                   10016             USA
## GSM2772938      New York                   10016             USA
## GSM2772939      New York                   10016             USA
##                supplementary_file data_row_count     age.ch1 batch.i.ii.ch1
##                       <character>    <character> <character>    <character>
## GSM2772660 ftp://ftp.ncbi.nlm.n..          54715          na              I
## GSM2772661 ftp://ftp.ncbi.nlm.n..          54715          na              I
## GSM2772662 ftp://ftp.ncbi.nlm.n..          54715          na              I
## GSM2772663 ftp://ftp.ncbi.nlm.n..          54715          na              I
## GSM2772664 ftp://ftp.ncbi.nlm.n..          54715          na              I
## ...                           ...            ...         ...            ...
## GSM2772935 ftp://ftp.ncbi.nlm.n..          54715          na             II
## GSM2772936 ftp://ftp.ncbi.nlm.n..          54715          na             II
## GSM2772937 ftp://ftp.ncbi.nlm.n..          54715          na             II
## GSM2772938 ftp://ftp.ncbi.nlm.n..          54715          na             II
## GSM2772939 ftp://ftp.ncbi.nlm.n..          54715          na             II
##                bmi.ch1 bodyheight.ch1 cancer.type.ch1 dignity.ch1
##            <character>    <character>     <character> <character>
## GSM2772660          na             na              BC          na
## GSM2772661          na             na              BC          na
## GSM2772662          na             na              BC          na
## GSM2772663          na             na              BC          na
## GSM2772664          na             na              BC          na
## ...                ...            ...             ...         ...
## GSM2772935          na             na             PCA          na
## GSM2772936          na             na             PCA          na
## GSM2772937          na             na             PCA          na
## GSM2772938          na             na             PCA          na
## GSM2772939          na             na             PCA          na
##            diseasecode.ch1 diseasedescription.ch1  gender.ch1 grading.ch1
##                <character>            <character> <character> <character>
## GSM2772660              na                     na          na          na
## GSM2772661              na                     na          na          na
## GSM2772662              na                     na          na          na
## GSM2772663              na                     na          na          na
## GSM2772664              na                     na          na          na
## ...                    ...                    ...         ...         ...
## GSM2772935              na                     na          na          na
## GSM2772936              na                     na          na          na
## GSM2772937              na                     na          na          na
## GSM2772938              na                     na          na          na
## GSM2772939              na                     na          na          na
##            histology.ch1 ischemiatime.ch1  normal.ch1   organ.ch1
##              <character>      <character> <character> <character>
## GSM2772660            na               na          no          na
## GSM2772661            na               na          no          na
## GSM2772662            na               na          no          na
## GSM2772663            na               na          no          na
## GSM2772664            na               na          no          na
## ...                  ...              ...         ...         ...
## GSM2772935            na               na         yes          na
## GSM2772936            na               na         yes          na
## GSM2772937            na               na         yes          na
## GSM2772938            na               na         yes          na
## GSM2772939            na               na         yes          na
##            primary...relapse.tumor.ch1 radicality.ch1   Stage.ch1
##                            <character>    <character> <character>
## GSM2772660                          na             na          na
## GSM2772661                          na             na          na
## GSM2772662                          na             na          na
## GSM2772663                          na             na          na
## GSM2772664                          na             na          na
## ...                                ...            ...         ...
## GSM2772935                          na             na          na
## GSM2772936                          na             na          na
## GSM2772937                          na             na          na
## GSM2772938                          na             na          na
## GSM2772939                          na             na          na
##            tumorcontent.ch1 tumorlocalization.ch1 tumorsize.ch1  weight.ch1
##                 <character>           <character>   <character> <character>
## GSM2772660               na                    na            na          na
## GSM2772661               na                    na            na          na
## GSM2772662               na                    na            na          na
## GSM2772663               na                    na            na          na
## GSM2772664               na                    na            na          na
## ...                     ...                   ...           ...         ...
## GSM2772935               na                    na            na          na
## GSM2772936               na                    na            na          na
## GSM2772937               na                    na            na          na
## GSM2772938               na                    na            na          na
## GSM2772939               na                    na            na          na

That is a lot of output to look through. In Rstudio, we can use the interactive viewer to look through the sample information more easily.

View(colData(se))

Questions

  • What column contains the tumor type?
  • What column contains information about whether the sample is a tumor or not?

I’ve searched through the columns and picked out two columns that contain the information I want to use:

  • Tumor type: cancer.type.ch1
  • Normal/tumor status: normal.ch1

To look at the breakdown of samples based on both the tumor type and whether the sample was derived from the tumor itself or non-tumor tissue nearby. The table function can do that.

with(colData(se),table(`cancer.type.ch1`,`normal.ch1`))
##                normal.ch1
## cancer.type.ch1 no yes
##           BC    65  10
##           CRC   57  12
##           NSCLC 60   9
##           PCA   60   7

Questions

  • How many Breast Cancer (BC) samples are there?
  • How many Breast Cancer (BC) tumors are not normal (are tumors)?

Hint: Tumor types are in rows and normal/tumor status is in columns.

Creating a heatmap

Recall that a “heatmap” is a combination of three pieces to make a plot of gene expression data.

  1. Clustering of samples
  2. Clustering of genes
  3. A “heatmap” of each gene expression value for each sample.

In some cases, this kind of plot will show a structure of blocks of color that may represent genes that are different between sets of samples. Without the clustering of samples and genes, such structure in the data will not be visible.

Data preparation

Before creating a heatmap, I am going to narrow down the data a bit to include the top 500 most variable genes. The rational for doing so is two-fold.

  1. The most variable genes are the ones that carry the most information about the differences between the samples.
  2. This one is pragmatic: viewing thousands of genes in the rows of a heatmap is impossible without a REALLY big page.

The following block of code uses the sd function on each row to calculate the Standard Deviation (abbreviated “sd”). Then, we select the 500 genes with the largest standard deviation

## ----sds,cache=TRUE,echo=TRUE-------------------------------------------------
sds = apply(assay(se, 'exprs'),1,sd)
dat = assay(se, 'exprs')[order(sds,decreasing = TRUE)[1:500],]

The variable dat is a matrix that contains the resulting top 500 most variable genes for each sample. We can double-check by looking at the dimensions of dat.

dim(dat)
## [1] 500 280

I also want to create a small dataset that contains the tumor type and the tumor/normal status. I’ll create that using the following code:

sampdat = data.frame(
  type=factor(colData(se)[,'cancer.type.ch1']),
  isnormal = factor(colData(se)[,'normal.ch1'])
)

We can take a quick look at these data:

head(sampdat)
##   type isnormal
## 1   BC       no
## 2   BC       no
## 3   BC       no
## 4   BC       no
## 5   BC       no
## 6   BC       no
summary(sampdat)
##     type    isnormal 
##  BC   :75   no :242  
##  CRC  :69   yes: 38  
##  NSCLC:69            
##  PCA  :67

Create and customize the heatmap

We’ll be using the ComplexHeatmap package to produce our heatmap.

library(ComplexHeatmap)

We can start with a basic heatmap of our top 500 most variable genes and all samples.

Heatmap(dat,show_row_names = FALSE, show_column_names = FALSE)

Questions

  • What do the colors in the heatmap cells represent?
  • Do you see any blocks of color? If so, what do they mean to you and your analysis? (Hint: would you always expect such blocks?)

Let’s try to help ourselves a bit by adding some biological variables to the sample columns.

ha = HeatmapAnnotation(df=sampdat)
Heatmap(dat,top_annotation = ha,
        show_column_names = FALSE,
        show_row_names = FALSE)

Questions

Now that you have biological annotations for each sample, answer this question again:

  • Do you see any blocks of color? If so, what do they mean to you and your analysis?

Heatmap wrapup

  • While we used gene expression data here, any “matrix” of numbers can be viewed as a heatmap.
  • Heatmaps can help you discover structure in your data that you might not see otherwise.
  • By adding annotation (such as tumor type) to the heatmap, you can begin to formulate hypotheses about the association between the rows, columns, and important variables in your data.

Principal Components Analysis

For this section, we’ll again use the most variable genes that we defined at the beginning of the heatmap section. Performing a principal component analysis in R starts with a matrix as input. In this case, that input matrix, dat contains gene expression data. But it could be any matrix that you want to simplify by reducing dimensionality.

The main function for doing principal components analysis in R is the prcomp function. In the following block of code, I’m going to go ahead and apply it to our dat matrix.

# rakk=10 limits calculation to only 10 principal components. This is only for convenience.
pcres = prcomp(dat,rank. = 10)

Summary of PC results

We can quickly look at a summary of the results:

summary(pcres)
## Importance of first k=10 (out of 280) components:
##                            PC1     PC2      PC3     PC4     PC5     PC6    PC7
## Standard deviation     28.0114 15.5878 11.06765 9.17050 6.07968 5.35715 4.9017
## Proportion of Variance  0.5096  0.1578  0.07955 0.05462 0.02401 0.01864 0.0156
## Cumulative Proportion   0.5096  0.6674  0.74695 0.80157 0.82558 0.84422 0.8598
##                            PC8     PC9   PC10
## Standard deviation     3.79590 3.43745 3.1151
## Proportion of Variance 0.00936 0.00767 0.0063
## Cumulative Proportion  0.86918 0.87685 0.8832

This summary has a lot of useful information. In particular, the Proportion of Variance is how much of the variation of our entire dataset is captured by each PC. For example, if we summarize our gene expression dataset with the first PC, we will have captured more than 50% of the total variation. The first seven PCs each capture more than 1% of the total variation. After that, the PCs capture less than that. The top 10 PCs capture a total of 88.32% of the variation (Cumulative Proportion).

Plotting PCs

I’m going to manipulate the PCA results into a data.frame so that we can plot. Just take my word for it right now.

pcdf = data.frame(sampdat,pcres$rotation)
summary(pcdf)
##     type    isnormal       PC1               PC2           
##  BC   :75   no :242   Min.   :0.02484   Min.   :-0.139694  
##  CRC  :69   yes: 38   1st Qu.:0.04870   1st Qu.:-0.011519  
##  NSCLC:69             Median :0.05936   Median : 0.015814  
##  PCA  :67             Mean   :0.05842   Mean   :-0.008278  
##                       3rd Qu.:0.06865   3rd Qu.: 0.032701  
##                       Max.   :0.08138   Max.   : 0.062973  
##       PC3                  PC4                 PC5           
##  Min.   :-0.1240729   Min.   :-0.127509   Min.   :-0.198953  
##  1st Qu.:-0.0440144   1st Qu.:-0.031096   1st Qu.:-0.020829  
##  Median : 0.0054298   Median : 0.005786   Median :-0.005820  
##  Mean   :-0.0005879   Mean   : 0.005708   Mean   :-0.001857  
##  3rd Qu.: 0.0306895   3rd Qu.: 0.056669   3rd Qu.: 0.014221  
##  Max.   : 0.1143564   Max.   : 0.117534   Max.   : 0.175172  
##       PC6                 PC7                 PC8            
##  Min.   :-0.137764   Min.   :-0.212817   Min.   :-0.1347058  
##  1st Qu.:-0.046031   1st Qu.:-0.033061   1st Qu.:-0.0391158  
##  Median :-0.015361   Median : 0.001675   Median : 0.0019482  
##  Mean   :-0.005747   Mean   :-0.002336   Mean   : 0.0001779  
##  3rd Qu.: 0.018247   3rd Qu.: 0.033663   3rd Qu.: 0.0312022  
##  Max.   : 0.191060   Max.   : 0.181785   Max.   : 0.1860653  
##       PC9                  PC10          
##  Min.   :-0.1705955   Min.   :-0.183806  
##  1st Qu.:-0.0357108   1st Qu.:-0.033834  
##  Median :-0.0006643   Median : 0.002368  
##  Mean   :-0.0009023   Mean   : 0.001586  
##  3rd Qu.: 0.0338716   3rd Qu.: 0.038594  
##  Max.   : 0.1492061   Max.   : 0.161559
head(pcdf)
##            type isnormal        PC1          PC2         PC3          PC4
## GSM2772660   BC       no 0.06170998 0.0118201202 -0.07277076  0.026097242
## GSM2772661   BC       no 0.04823465 0.0143191336 -0.04327190  0.014975967
## GSM2772662   BC       no 0.07530785 0.0140713629 -0.05830088  0.004662037
## GSM2772663   BC       no 0.06470332 0.0009215445 -0.08254243  0.037084681
## GSM2772664   BC       no 0.07562771 0.0243711057 -0.06538010 -0.006604001
## GSM2772665   BC       no 0.06645876 0.0100085881 -0.05425063  0.024063192
##                     PC5         PC6          PC7        PC8          PC9
## GSM2772660 -0.044013499  0.06995323 -0.004717639 0.12908292 -0.070579681
## GSM2772661 -0.070268681 -0.03519182 -0.050319551 0.13667807 -0.006154981
## GSM2772662 -0.036916960  0.07085866 -0.007843776 0.12549398 -0.007211999
## GSM2772663  0.006212287  0.08922814 -0.038393757 0.04796887 -0.165708025
## GSM2772664  0.008372729  0.02561096  0.049302349 0.09439716 -0.063215452
## GSM2772665 -0.023214048  0.05424151 -0.022572000 0.09366333 -0.016355000
##                   PC10
## GSM2772660  0.01606562
## GSM2772661  0.06852335
## GSM2772662 -0.02794572
## GSM2772663 -0.13703616
## GSM2772664 -0.13337958
## GSM2772665 -0.00746967

With our data.frame in hand that includes the sample information and the PC values for each sample, we can use ggplot2 to make some plots

library(ggplot2)
ggplot(pcdf,aes(x=PC1,y=PC2)) + geom_point()

This plot isn’t very helpful since we cannot see the biological variables of interest. Let’s add them:

ggplot(pcdf,aes(x=PC1,y=PC2,color=type,shape=isnormal)) + geom_point()

Exercise

  • Change the variable on the X and Y axis to look at three or four combinations of “PC1”, “PC2”, “PC3”, etc.

To look at multiple variables at once, we can use the GGally package to create multiple PCs in the the same plot.

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(pcdf,columns=3:8,
        mapping=aes(color=type,
                    alpha=0.5)) # alpha makes things a little transparent for easier viewing

PCA wrapup

The PCA part of things is really meant to give a taste of PC analysis and not to make us all PCA masters. When presented with large datasets, consider reading up and using PCA.